PP-1.jpg

.

Introduction to Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

I have used the pandas, matplotlib and seaborn libraries for this particular project to load, clean, visualize and make sense of the dataset.¶

In [2]:
data=pd.read_csv("Data/train.csv")

Data Description

In [3]:
data.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

There are 1460 instances of training data. Total number of attributes equals 81, of which 36 is quantitative, 43 categorical, Id and SalePrice.¶

Quantitative: EnclosedPorch, Fireplaces,GrLivArea, HalfBath, KitchenAbvGr, LotArea, YearBuilt, YrSold¶

Qualitative: Alley, CentralAir, Electrical, GarageQual, GarageType, Heating, Neighborhood.¶

In [5]:
pd.set_option('display.max_columns', None)
data.describe()
Out[5]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1379.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 46.549315 567.240411 1057.429452 1162.626712 346.992466 5.844521 1515.463699 0.425342 0.057534 1.565068 0.382877 2.866438 1.046575 6.517808 0.613014 1978.506164 1.767123 472.980137 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 161.319273 441.866955 438.705324 386.587738 436.528436 48.623081 525.480383 0.518911 0.238753 0.550916 0.502885 0.815778 0.220338 1.625393 0.644666 24.689725 0.747315 213.804841 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 0.000000 0.000000 334.000000 0.000000 0.000000 334.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 1900.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 223.000000 795.750000 882.000000 0.000000 0.000000 1129.500000 0.000000 0.000000 1.000000 0.000000 2.000000 1.000000 5.000000 0.000000 1961.000000 1.000000 334.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 477.500000 991.500000 1087.000000 0.000000 0.000000 1464.000000 0.000000 0.000000 2.000000 0.000000 3.000000 1.000000 6.000000 1.000000 1980.000000 2.000000 480.000000 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 0.000000 808.000000 1298.250000 1391.250000 728.000000 0.000000 1776.750000 1.000000 0.000000 2.000000 1.000000 3.000000 1.000000 7.000000 1.000000 2002.000000 2.000000 576.000000 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 2336.000000 6110.000000 4692.000000 2065.000000 572.000000 5642.000000 3.000000 2.000000 3.000000 2.000000 8.000000 3.000000 14.000000 3.000000 2010.000000 4.000000 1418.000000 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

The mean price of the houses are 182470.369733. The most costly house is worth 755000 and the cheapest one is worth 34900. However there are some irregularities in count in some of the columns.¶

Check and Clear The Null Values

In [6]:
pd.set_option('display.max_rows', None)
data.isnull().sum()
Out[6]:
Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
GrLivArea           0
BsmtFullBath        0
BsmtHalfBath        0
FullBath            0
HalfBath            0
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1453
Fence            1179
MiscFeature      1406
MiscVal             0
MoSold              0
YrSold              0
SaleType            0
SaleCondition       0
SalePrice           0
dtype: int64
In [7]:
data["Alley"]=data["Alley"].fillna("None")
data["MasVnrArea"]=data["MasVnrArea"].fillna("None")
data["BsmtExposure"]=data["BsmtExposure"].fillna("No Garage")
data["BsmtFinType1"]=data["BsmtFinType1"].fillna("No Garage")
data["BsmtFinType2"]=data["BsmtFinType2"].fillna("No Garage")
data["FireplaceQu"]=data["FireplaceQu"].fillna("NoFirePlace")
data["GarageType"]=data["GarageType"].fillna("None")
data["GarageFinish"]=data["GarageFinish"].fillna("None")
data["GarageQual"]=data["GarageQual"].fillna("None")
data["GarageCond"]=data["GarageCond"].fillna("None")
data["GarageYrBlt"]=data["GarageYrBlt"].fillna("None")
data["PoolQC"]=data["PoolQC"].fillna("NoPool")
data["Fence"]=data["Fence"].fillna("NoFence")
data["MiscFeature"]=data["MiscFeature"].fillna("None")

We found out there are some null values present in the dataset like in "Alley" column there are 1369 null values. However those null values represents something else like in "PoolQC" null values represents the residence have Zero number of pools. So we cannot drop all the rows with null values the better idea would be to replace the null values with "None" or any other meaningful values.¶

In [8]:
data=data.dropna()
In [9]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1163 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1163 non-null   int64  
 1   MSSubClass     1163 non-null   int64  
 2   MSZoning       1163 non-null   object 
 3   LotFrontage    1163 non-null   float64
 4   LotArea        1163 non-null   int64  
 5   Street         1163 non-null   object 
 6   Alley          1163 non-null   object 
 7   LotShape       1163 non-null   object 
 8   LandContour    1163 non-null   object 
 9   Utilities      1163 non-null   object 
 10  LotConfig      1163 non-null   object 
 11  LandSlope      1163 non-null   object 
 12  Neighborhood   1163 non-null   object 
 13  Condition1     1163 non-null   object 
 14  Condition2     1163 non-null   object 
 15  BldgType       1163 non-null   object 
 16  HouseStyle     1163 non-null   object 
 17  OverallQual    1163 non-null   int64  
 18  OverallCond    1163 non-null   int64  
 19  YearBuilt      1163 non-null   int64  
 20  YearRemodAdd   1163 non-null   int64  
 21  RoofStyle      1163 non-null   object 
 22  RoofMatl       1163 non-null   object 
 23  Exterior1st    1163 non-null   object 
 24  Exterior2nd    1163 non-null   object 
 25  MasVnrType     1163 non-null   object 
 26  MasVnrArea     1163 non-null   object 
 27  ExterQual      1163 non-null   object 
 28  ExterCond      1163 non-null   object 
 29  Foundation     1163 non-null   object 
 30  BsmtQual       1163 non-null   object 
 31  BsmtCond       1163 non-null   object 
 32  BsmtExposure   1163 non-null   object 
 33  BsmtFinType1   1163 non-null   object 
 34  BsmtFinSF1     1163 non-null   int64  
 35  BsmtFinType2   1163 non-null   object 
 36  BsmtFinSF2     1163 non-null   int64  
 37  BsmtUnfSF      1163 non-null   int64  
 38  TotalBsmtSF    1163 non-null   int64  
 39  Heating        1163 non-null   object 
 40  HeatingQC      1163 non-null   object 
 41  CentralAir     1163 non-null   object 
 42  Electrical     1163 non-null   object 
 43  1stFlrSF       1163 non-null   int64  
 44  2ndFlrSF       1163 non-null   int64  
 45  LowQualFinSF   1163 non-null   int64  
 46  GrLivArea      1163 non-null   int64  
 47  BsmtFullBath   1163 non-null   int64  
 48  BsmtHalfBath   1163 non-null   int64  
 49  FullBath       1163 non-null   int64  
 50  HalfBath       1163 non-null   int64  
 51  BedroomAbvGr   1163 non-null   int64  
 52  KitchenAbvGr   1163 non-null   int64  
 53  KitchenQual    1163 non-null   object 
 54  TotRmsAbvGrd   1163 non-null   int64  
 55  Functional     1163 non-null   object 
 56  Fireplaces     1163 non-null   int64  
 57  FireplaceQu    1163 non-null   object 
 58  GarageType     1163 non-null   object 
 59  GarageYrBlt    1163 non-null   object 
 60  GarageFinish   1163 non-null   object 
 61  GarageCars     1163 non-null   int64  
 62  GarageArea     1163 non-null   int64  
 63  GarageQual     1163 non-null   object 
 64  GarageCond     1163 non-null   object 
 65  PavedDrive     1163 non-null   object 
 66  WoodDeckSF     1163 non-null   int64  
 67  OpenPorchSF    1163 non-null   int64  
 68  EnclosedPorch  1163 non-null   int64  
 69  3SsnPorch      1163 non-null   int64  
 70  ScreenPorch    1163 non-null   int64  
 71  PoolArea       1163 non-null   int64  
 72  PoolQC         1163 non-null   object 
 73  Fence          1163 non-null   object 
 74  MiscFeature    1163 non-null   object 
 75  MiscVal        1163 non-null   int64  
 76  MoSold         1163 non-null   int64  
 77  YrSold         1163 non-null   int64  
 78  SaleType       1163 non-null   object 
 79  SaleCondition  1163 non-null   object 
 80  SalePrice      1163 non-null   int64  
dtypes: float64(1), int64(35), object(45)
memory usage: 745.0+ KB

After clearning and replacing the null values, the no.of instances of the training data reduces from 1460 to 1163 with no null values and ready for analysis.¶

In [10]:
data["FireplaceQu"].value_counts()
Out[10]:
NoFirePlace    571
Gd             319
TA             215
Fa              24
Ex              20
Po              14
Name: FireplaceQu, dtype: int64

Feature Engineering

In [11]:
data["RmsAbvGrd"]=data['FullBath'] + data['HalfBath'] + data['TotRmsAbvGrd']
In [12]:
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']
In [13]:
data['Age'] = data['YrSold'] - data['YearBuilt']
In [14]:
data['GarageCond'] = data['GarageCond'].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})

Here I have created some new columns like the Total rooms above ground (by adding total number of rooms full bathrooms and half bathrooms), Total Surface area and Age of the house. Also Change the GarageCondition values from string to rank numerics.¶

Exploratory Data Analysis(EDA)

1. Visualization of SalePrice(Target Variable(y))

In [15]:
data["SalePrice"].describe()
Out[15]:
count      1163.000000
mean     182470.369733
std       83251.315928
min       34900.000000
25%      129000.000000
50%      160000.000000
75%      215000.000000
max      755000.000000
Name: SalePrice, dtype: float64
In [16]:
sns.set(rc={'figure.figsize':(8,3)})
sns.histplot(data['SalePrice'], kde=True,color="orange")
Out[16]:
<AxesSubplot:xlabel='SalePrice', ylabel='Count'>

The sales price follows a bell curve and is noramlly distributed with mean price of 182470.369733 and Standard Deviation of 83251.315928. So we can assume that majority of house price should be between 100000 to 250000¶

2. Correlation of SalePrice with other Independent Variables

In [17]:
data.corr()['SalePrice']
Out[17]:
Id              -0.042740
MSSubClass      -0.096158
LotFrontage      0.350459
LotArea          0.313132
OverallQual      0.798380
OverallCond     -0.129849
YearBuilt        0.536357
YearRemodAdd     0.515019
BsmtFinSF1       0.387365
BsmtFinSF2      -0.023806
BsmtUnfSF        0.191331
TotalBsmtSF      0.624018
1stFlrSF         0.627775
2ndFlrSF         0.307785
LowQualFinSF    -0.025423
GrLivArea        0.705342
BsmtFullBath     0.220941
BsmtHalfBath    -0.033800
FullBath         0.582506
HalfBath         0.270098
BedroomAbvGr     0.168081
KitchenAbvGr    -0.109068
TotRmsAbvGrd     0.551535
Fireplaces       0.476748
GarageCars       0.649845
GarageArea       0.630398
GarageCond       0.253901
WoodDeckSF       0.336905
OpenPorchSF      0.314584
EnclosedPorch   -0.169488
3SsnPorch        0.038737
ScreenPorch      0.117900
PoolArea         0.093132
MiscVal         -0.031603
MoSold           0.049659
YrSold          -0.011652
SalePrice        1.000000
RmsAbvGrd        0.616317
TotalSF          0.776461
Age             -0.536046
Name: SalePrice, dtype: float64
In [18]:
sns.set(rc={'figure.figsize':(30,20)})
corr=data.corr()
sns.heatmap(corr, cmap='YlOrBr',  vmin=-1, vmax=1,annot=True, fmt='.2f')
Out[18]:
<AxesSubplot:>

Using the heatmap we visualize which variables are strongly or moderately correlated with the Sales Price. Now we will work with the independent variables with the highest correaltion¶

In [19]:
corr = data.corr()
strong_corr = corr['SalePrice'][abs(corr['SalePrice']) > 0.5]
sns.set(rc={'figure.figsize':(12,6)})
sns.heatmap(data[strong_corr.index].corr(), annot=True, cmap='YlOrBr')
Out[19]:
<AxesSubplot:>

Here the "Overall Quality of the house", "Living Area" and "Total Surface area" show very strong positive correlation with the price of the house¶

3. Relation between SalePrice and GrLivArea: Above ground living

In [20]:
data["GrLivArea"].corr(data["SalePrice"])
Out[20]:
0.705342254866112
In [21]:
sns.set(rc={'figure.figsize':(6,6)})
sns.scatterplot(data=data,x=data['GrLivArea'],y=data["SalePrice"], color="orange")
Out[21]:
<AxesSubplot:xlabel='GrLivArea', ylabel='SalePrice'>

As explained in the heatmap previously Living area is very strongly correlated with the SalesPrice and with the help of scatter plot we can now conclude that with increase in Living area the price of house will increase.¶

4. How does the sale price vary based on the overall quality of a house?

In [22]:
data.groupby('OverallQual')["SalePrice"].describe()
Out[22]:
count mean std min 25% 50% 75% max
OverallQual
2 3.0 51770.333333 14254.200796 35311.0 47655.50 60000.0 60000.0 60000.0
3 14.0 84391.071429 25435.982249 37900.0 69375.00 83000.0 102500.0 126175.0
4 86.0 106518.767442 25206.920495 34900.0 88367.75 108729.5 124275.0 176000.0
5 319.0 131741.313480 26604.858223 55993.0 117750.00 132000.0 145000.0 225000.0
6 281.0 156759.576512 34730.874739 76000.0 134000.00 155900.0 177000.0 259500.0
7 261.0 207037.544061 45552.404146 82500.0 179000.00 200000.0 230000.0 383970.0
8 141.0 277665.035461 64219.617119 122000.0 236500.00 270000.0 312500.0 538000.0
9 42.0 368810.952381 81811.134066 239000.0 319925.00 347500.0 392074.0 611657.0
10 16.0 435193.187500 169778.833404 160000.0 334375.00 414430.5 495000.0 755000.0
In [23]:
sns.boxplot(x=data['OverallQual'],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
Out[23]:
<AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>

Most house in the dataset have overall 5 rating (average). As the rating increase the house price will increase significantly. Highest Average Price of houses have a 10 rating including the most costly house of the dataset and Houses with 9 rating have the smallest standard deviation.¶

5. Exploring the distribution of the zoning classification and how it relates to the sale price.

In [24]:
data.groupby("MSZoning")["SalePrice"].describe()
Out[24]:
count mean std min 25% 50% 75% max
MSZoning
C (all) 10.0 74528.000000 33791.092031 34900.0 43998.25 74700.0 98332.0 133900.0
FV 54.0 218930.814815 55165.134293 144152.0 174500.00 208600.0 251645.0 370878.0
RH 12.0 129494.500000 39821.492590 76000.0 98200.00 131500.0 148608.5 200000.0
RL 892.0 194470.776906 84812.231061 52000.0 138000.00 175000.0 228125.0 755000.0
RM 195.0 126275.020513 50219.368150 37900.0 100000.00 120000.0 138250.0 475000.0
In [25]:
sns.boxplot(data=data,x=data["MSZoning"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
Out[25]:
<AxesSubplot:xlabel='MSZoning', ylabel='SalePrice'>

Most number of houses are in zones with Low Density and Floating Village have the highest mean price.¶

6. Does alley access have any effect on the Sales Price?

In [26]:
print(data.groupby("Alley")["SalePrice"].median())
data.groupby("Alley")["SalePrice"].describe()
Alley
Grvl    122000.0
None    164000.0
Pave    172200.0
Name: SalePrice, dtype: float64
Out[26]:
count mean std min 25% 50% 75% max
Alley
Grvl 46.0 122591.391304 35398.503360 52500.0 105375.0 122000.0 136750.0 256000.0
None 1079.0 185584.070436 84827.911788 34900.0 130000.0 164000.0 222000.0 755000.0
Pave 38.0 166542.736842 38590.669108 40000.0 151250.0 172200.0 184750.0 265979.0
In [27]:
sns.set(rc={'figure.figsize':(4,6)})
sns.boxplot(data=data,x=data["Alley"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
Out[27]:
<AxesSubplot:xlabel='Alley', ylabel='SalePrice'>

House with no access to any alley have the highest average price followed by paved alley and gravel alley. The most expensive house in dataset have no alley access¶

7. What is the relationship between the total number of rooms in a house and its sale price?

In [28]:
data.groupby('RmsAbvGrd')['SalePrice'].describe()
Out[28]:
count mean std min 25% 50% 75% max
RmsAbvGrd
4 13.0 97093.153846 35262.858071 35311.0 80000.0 86000.0 131500.00 155900.0
5 53.0 105874.018868 25237.549014 34900.0 89000.0 110000.0 125000.00 141000.0
6 153.0 124122.000000 35364.469850 52000.0 109900.0 123000.0 134800.00 385000.0
7 209.0 144889.875598 46345.500199 37900.0 123500.0 140000.0 160000.00 392500.0
8 203.0 176919.064039 53750.820659 75000.0 141500.0 170000.0 203500.00 374000.0
9 190.0 198288.742105 66909.020020 81000.0 153925.0 181000.0 234542.50 440000.0
10 136.0 208172.411765 69842.240489 87000.0 159000.0 189000.0 232750.00 423000.0
11 100.0 226486.750000 69706.048326 110000.0 179530.0 215000.0 264076.25 501837.0
12 50.0 256018.180000 96229.603393 90000.0 175500.0 255750.0 310250.00 555000.0
13 31.0 288718.838710 91848.880078 106000.0 249000.0 309000.0 346321.50 451950.0
14 13.0 441257.307692 189804.109524 154300.0 320000.0 424870.0 611657.00 755000.0
15 7.0 322690.142857 173431.890561 160000.0 190875.0 200500.0 480000.00 556581.0
16 5.0 354186.600000 190075.080160 200000.0 200000.0 250000.0 538000.00 582933.0
In [29]:
data['RmsAbvGrd'].corr(data['SalePrice'])
Out[29]:
0.6163169162341261
In [30]:
sns.set(rc={'figure.figsize':(8,6)})
sns.boxplot(x=data['RmsAbvGrd'],y=data['SalePrice'],palette=['#FF9500', '#ffb30a'])
Out[30]:
<AxesSubplot:xlabel='RmsAbvGrd', ylabel='SalePrice'>

The Price of a House is strongly correlated with the Number of Rooms and Bathrooms present.The most number of rooms in a house is 16, however most expensive house of the dataset(Price="755000.0") have 14 Total Rooms. Average Price of a House with 14 rooms is highest and House with 4 rooms is Lowest¶

8. What is the most common type of foundation in the dataset, and how does it relate to the sale price?

In [31]:
data.groupby('Foundation')['SalePrice'].describe()
Out[31]:
count mean std min 25% 50% 75% max
Foundation
BrkTil 137.0 132987.934307 55424.527582 37900.0 102000.0 127000.0 153500.00 475000.0
CBlock 486.0 144841.370370 45765.603062 34900.0 119500.0 139000.0 160000.00 402861.0
PConc 532.0 229883.518797 90676.888880 78000.0 174000.0 209750.0 270000.00 755000.0
Stone 6.0 165959.166667 78557.704413 102776.0 109625.0 126500.0 233734.25 266500.0
Wood 2.0 153500.000000 14849.242405 143000.0 148250.0 153500.0 158750.00 164000.0
In [32]:
sns.barplot(x=data["Foundation"],y=data["SalePrice"],data=data,palette=['#FF9500', '#ffb30a'])
Out[32]:
<AxesSubplot:xlabel='Foundation', ylabel='SalePrice'>

So, the most commonly used Foundations for a house are "Poured Contrete", "Cinder Block" and "Brick Tile". The most expensive type of Foundation for a house is "Poured Contrete" followed by "Wood" and "Stone". The most expensive house (Price="755000.0") is built with "Poured Contrete" and the cheapest house in the dataset (Price="34900.0") is made with "Cinder Block".¶

9. What is the relationship between the area of the basement and the sale price of a house?

In [33]:
print(f"Correlation: {data['TotalBsmtSF'].corr(data['SalePrice'])}")
sns.scatterplot(x=data['TotalBsmtSF'], y=data['SalePrice'], color="orange")
Correlation: 0.6240176156843997
Out[33]:
<AxesSubplot:xlabel='TotalBsmtSF', ylabel='SalePrice'>

The area of basement and Price of the house shows strong positive correlation as shown in the Scatterplot that means we can conclude a larger area of the basement adds more value to the house and increases its sale price.¶

10. Which neighborhoods have the highest average sales price, and how do they compare to each other?

In [34]:
data.groupby("Neighborhood")["SalePrice"].describe()
Out[34]:
count mean std min 25% 50% 75% max
Neighborhood
Blmngtn 14.0 191790.357143 30947.405296 159895.0 172875.0 186000.0 193775.75 264561.0
Blueste 2.0 137500.000000 19091.883092 124000.0 130750.0 137500.0 144250.00 151000.0
BrDale 16.0 104493.750000 14330.176493 83000.0 91000.0 106000.0 118000.00 125000.0
BrkSide 49.0 127399.489796 38475.963983 52000.0 105000.0 127500.0 141500.00 223500.0
ClearCr 12.0 212166.666667 56660.204623 130000.0 182750.0 205750.0 240375.00 328000.0
CollgCr 125.0 197625.328000 50108.821490 110000.0 164990.0 196500.0 221500.00 424870.0
Crawfor 41.0 203298.463415 69563.313243 90350.0 154900.0 188700.0 239000.00 392500.0
Edwards 83.0 130343.373494 41173.733466 58500.0 107250.0 126000.0 146700.00 320000.0
Gilbert 49.0 194059.306122 41871.184636 141000.0 175000.0 180000.0 192000.00 377500.0
IDOTRR 33.0 99093.333333 31962.801367 34900.0 81000.0 103000.0 120500.00 152000.0
MeadowV 15.0 98486.666667 24395.721717 75000.0 82750.0 88000.0 112000.00 151400.0
Mitchel 34.0 153886.352941 35947.821599 84500.0 130937.5 145000.0 170750.00 240000.0
NAmes 178.0 147864.511236 33021.925742 97000.0 129000.0 140000.0 159375.00 345000.0
NPkVill 7.0 144642.857143 8754.590633 127500.0 143000.0 147000.0 148500.00 155000.0
NWAmes 45.0 184211.111111 35510.936038 82500.0 165000.0 179900.0 202500.00 278000.0
NoRidge 33.0 337412.363636 130116.489801 190000.0 265000.0 290000.0 341000.00 755000.0
NridgHt 75.0 316214.506667 94665.046255 154000.0 253646.5 315000.0 373201.00 611657.0
OldTown 107.0 128643.074766 53870.859031 37900.0 105450.0 119900.0 140000.00 475000.0
SWISU 23.0 141947.130435 33859.866674 60000.0 127500.0 137450.0 164217.00 200000.0
Sawyer 46.0 136666.130435 20389.995805 62383.0 129000.0 135000.0 146975.00 180000.0
SawyerW 45.0 189342.600000 56601.924579 76000.0 153337.0 179900.0 228000.00 320000.0
Somerst 75.0 230586.880000 57637.330120 144152.0 184450.0 229456.0 258750.00 423000.0
StoneBr 20.0 335523.750000 111762.680173 180000.0 248000.0 322000.0 392608.00 556581.0
Timber 29.0 252498.379310 66764.973332 170000.0 187500.0 248900.0 305000.00 378500.0
Veenker 7.0 258714.285714 74470.272113 181500.0 202500.0 245500.0 297000.00 385000.0
In [35]:
sns.set(rc={'figure.figsize':(30,10)})
sns.barplot(x=data["Neighborhood"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
Out[35]:
<AxesSubplot:xlabel='Neighborhood', ylabel='SalePrice'>

North Ridge region have the highest avegare sale price whereas the Meadow Village have the lowest average sale price. North Ames have the most number house on sale. North Ridge has the most expensive house(Price=755000.0) and Iowa DOT and Rail Road have the cheapest house(Price=34900.0) of the dataset.¶

11. How does the number of fireplaces in a house affect its sale price?

In [36]:
data.groupby("Fireplaces")["SalePrice"].describe()
Out[36]:
count mean std min 25% 50% 75% max
Fireplaces
0 571.0 141983.000000 44934.587316 34900.0 114502.0 135000.0 163750.0 342643.0
1 511.0 216776.514677 83977.283578 60000.0 158000.0 192000.0 262640.0 625000.0
2 77.0 250622.701299 133380.484825 110000.0 160000.0 210000.0 315000.0 755000.0
3 4.0 267500.000000 100041.657990 160000.0 193750.0 275000.0 348750.0 360000.0
In [37]:
data["Fireplaces"].corr(data["SalePrice"])
Out[37]:
0.47674796573225514
In [38]:
sns.set(rc={'figure.figsize':(8,5)})
sns.boxplot(x=data["Fireplaces"], y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
Out[38]:
<AxesSubplot:xlabel='Fireplaces', ylabel='SalePrice'>

There is a moderate positive correaltion between the number of Fireplaces and Price. Only 5 Houses have more than 3 Fireplaces and the Average Price of the houses are higher than the other houses. However The Most Expensive house of the dataset have only 2 Fireplaces so there must be other strong factors that affect the price¶

12. How does Price gets affected by Garage Area and Condition?

In [39]:
data["GarageArea"].corr(data["SalePrice"])
Out[39]:
0.6303980450640168
In [40]:
sns.scatterplot(x=data["GarageArea"],y=data["SalePrice"], color="orange")
Out[40]:
<AxesSubplot:xlabel='GarageArea', ylabel='SalePrice'>
In [41]:
print(data.groupby('GarageCond')["SalePrice"].describe())
plt.figure(figsize=(8, 10))
sns.boxplot(x='GarageCond', y='SalePrice', data=data,palette=['#FF9500', '#ffb30a'])
             count           mean           std       min       25%       50%  \
GarageCond                                                                      
0             67.0  106300.746269  33018.934527   34900.0   84500.0  104000.0   
1              6.0  106416.666667  24079.901716   67000.0   98875.0  107000.0   
2             31.0  112625.516129  31615.160605   40000.0   93000.0  114504.0   
3           1052.0  190078.516160  83331.045428   35311.0  134500.0  170000.0   
4              5.0  150080.000000  26745.223873  118500.0  132000.0  145000.0   
5              2.0  124000.000000   4949.747468  120500.0  122250.0  124000.0   

                 75%       max  
GarageCond                      
0           127500.0  200500.0  
1           120000.0  137000.0  
2           137250.0  177000.0  
3           225000.0  755000.0  
4           174900.0  180000.0  
5           125750.0  127500.0  
Out[41]:
<AxesSubplot:xlabel='GarageCond', ylabel='SalePrice'>

As we can see the area of Garage is strongly correlated with Price and from the scatterplot we can see a positive slop. So, one can conclude that with increase in garage area the price of house will increase by some amount.¶

The Condition of the Garage is also an Important factor here as well. Most of the Houses have 3 rating or Avegare Rating and have the highest avegare price.¶

13. How Age of House change the Sales Price?

In [42]:
data['Age'].describe()
Out[42]:
count    1163.000000
mean       37.067928
std        31.993217
min         0.000000
25%         6.000000
50%        35.000000
75%        58.000000
max       136.000000
Name: Age, dtype: float64
In [43]:
sns.regplot(x=data['Age'],y=data['SalePrice'], color="#fb8b2ded")
Out[43]:
<AxesSubplot:xlabel='Age', ylabel='SalePrice'>

Average age of the houses in the dataset is 37. The Regression Scatterplot show a Negetive Correlation between the age and Price which means the price of houses will decrease over time.¶

14. How Total Area of the House change the Price of a House?

In [44]:
data["TotalSF"].describe()
Out[44]:
count     1163.000000
mean      2596.554600
std        821.976781
min        728.000000
25%       2038.000000
50%       2491.000000
75%       3034.000000
max      11752.000000
Name: TotalSF, dtype: float64
In [45]:
sns.scatterplot(x=data["TotalSF"],y=data["SalePrice"], color="orange")
Out[45]:
<AxesSubplot:xlabel='TotalSF', ylabel='SalePrice'>

Total Surface Area of the House which includes all 1st, 2nd and Basement floor in Square feet is Strongly Correlated with the Price. Average size of the houses are 2596sq.ft and The house with biggest area indoor is 11752sq.ft big.¶

15. Which is most used Electric System?

In [46]:
data.groupby('Electrical')['SalePrice'].describe()
Out[46]:
count mean std min 25% 50% 75% max
Electrical
FuseA 76.0 121633.000000 37501.718480 34900.0 101750.0 123750.0 144125.0 239000.0
FuseF 19.0 110759.842105 28164.370151 58500.0 90875.0 122000.0 130750.0 145000.0
FuseP 2.0 105000.000000 45254.833996 73000.0 89000.0 105000.0 121000.0 137000.0
Mix 1.0 67000.000000 NaN 67000.0 67000.0 67000.0 67000.0 67000.0
SBrkr 1065.0 188345.065728 83905.356850 37900.0 132500.0 169000.0 224900.0 755000.0
In [47]:
sns.barplot(x="Electrical",y="SalePrice",data=data,palette=['#FF9500', '#ffb30a'])
Out[47]:
<AxesSubplot:xlabel='Electrical', ylabel='SalePrice'>

Most commonly used elelctric system is Standard Circuit Breakers & Romex , The most expensive house of the dataset also uses the electric system and the average value of the system is higher than the rest. Mix electric system is used in only one house.¶

.

CONCLUSION

The sales price follows a bell curve and is noramlly distributed with mean price of 182470.369733 and Standard Deviation of 83251.315928. So we can assume that majority of house price should be between 100000 to 250000.

Here the "Overall Quality of the house", "Living Area" and "Total Surface area" show very strong positive correlation with the price of the house

Most house in the dataset have overall 5 rating (average). As the rating increase the house price will increase significantly. Highest Average Price of houses have a 10 rating including the most costly house of the dataset and Houses with 9 rating have the smallest standard deviation.

Most number of houses are in zones with Low Density and Floating Village have the highest mean price.

House with no access to any alley have the highest average price followed by paved alley and gravel alley. The most expensive house in dataset have no alley access

The Price of a House is strongly correlated with the Number of Rooms and Bathrooms present.The most number of rooms in a house is 16, however most expensive house of the dataset(Price="755000.0") have 14 Total Rooms. Average Price of a House with 14 rooms is highest and House with 4 rooms is Lowest.

The most commonly used Foundations for a house are "Poured Contrete", "Cinder Block" and "Brick Tile". The most expensive type of Foundation for a house is "Poured Contrete" followed by "Wood" and "Stone". The most expensive house (Price="755000.0") is built with "Poured Contrete" and the cheapest house in the dataset (Price="34900.0") is made with "Cinder Block".

The area of basement and Price of the house shows strong positive correlation as shown in the Scatterplot that means we can conclude a larger area of the basement adds more value to the house and increases its sale price.

North Ridge region have the highest avegare sale price whereas the Meadow Village have the lowest average sale price. North Ames have the most number house on sale. North Ridge has the most expensive house(Price=755000.0) and Iowa DOT and Rail Road have the cheapest house(Price=34900.0) of the dataset.

There is a moderate positive correaltion between the number of Fireplaces and Price. Only 5 Houses have more than 3 Fireplaces and the Average Price of the houses are higher than the other houses. However The Most Expensive house of the dataset have only 2 Fireplaces so there must be other strong factors that affect the price.

The area of Garage is strongly correlated with Price and from the scatterplot we can see a positive slop. So, one can conclude that with increase in garage area the price of house will increase by some amount. The Condition of the Garage is also an Important factor here as well. Most of the Houses have 3 rating or Avegare Rating and have the highest avegare price.

Average age of the houses in the dataset is 37. The Regression Scatterplot show a Negetive Correlation between the age and Price which means the price of houses will decrease over time.

Total Surface Area of the House which includes all 1st, 2nd and Basement floor in Square feet is Strongly Correlated with the Price. Average size of the houses are 2596sq.ft and The house with biggest area indoor is 11752sq.ft big.

Most commonly used elelctric system is Standard Circuit Breakers & Romex , The most expensive house of the dataset also uses the electric system and the average value of the system is higher than the rest. Mix electric system is used in only one house.

Reference

Pandas documentation:¶

https://pandas.pydata.org/docs/

Matplotlib documentation:¶

https://matplotlib.org/stable/index.html

Seaborn documentation:¶

https://seaborn.pydata.org/

blog - What is Exploratory Data Analysis? :¶

https://towardsdatascience.com/exploratory-data-analysis%028fc1cb20fd15

blog – Exploratory Data Analysis and Visualization Techniques:¶

https://www.analyticsvidhya.com/blog/2021/08/exploratory%02data-analysis-and-visualization-techniques-in-data-

Appendix

Here is the drive link containing the dataset & the Jupiter notebook code:¶

https://drive.google.com/drive/folders/1GCVMY7oZ53AMDs_tN3nzUviSpf1TvxlA?usp=sharing

1091916.jpg